Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce group_replica partial response strategy #33

Merged
merged 7 commits into from
May 15, 2024

Conversation

hczhu-db
Copy link
Collaborator

@hczhu-db hczhu-db commented Apr 28, 2024

This is to implement the idea in https://docs.google.com/document/d/1eGoUFNGq2pOhmQH_svBkUZ45ig8bVOZH6xi9wkNRN_E/edit#heading=h.axx9hx8o4mz5

This strategy takes effect only when the partial response tolerance flag is off. Therefore, by default, this strategy is off.

Tested the image in dev-aws-eu-west-1 without turning on the new strategy.

How does the new strategy work?

  1. A repeatable command line flag endpoint specifies multiple DNS names, like dnssrv+db-rep0:9092, dnssrv+db-rep1:9092, and dnssrv+range-store:9092
endpoints := extkingpin.Addrs(cmd.Flag("endpoint", "Addresses of statically configured Thanos API servers (repeatable). The scheme may be prefixed with 'dns+' or 'dnssrv+' to detect Thanos API servers through respective DNS lookups.").
		PlaceHolder("<endpoint>"))
  1. Querier resolves each DNS name and gets a set of endpoints. Each endpoint set has a replica key and a group key parsed from the DNS name according to a pre-defined format.
Group key: db
Replica key: db-rep0
Endpoints: IP0, IP1, ...
---
Group key: db
Replica key: db-rep1
Endpoints: IP0, IP1, ...
---
Group key: range-store
Replica key: rang-store
Endpoints: IP0, IP1, ...
  1. Querier fans out a query to all endpoints, even if one is determined unhealthy (different from other strategies). Querier counts how many failures from each group and replica and determines if the query result is actually complete.

@hczhu-db hczhu-db force-pushed the query-fan-out-fault-tolerance branch 6 times, most recently from 89aa401 to d5b25f5 Compare April 28, 2024 05:04
@hczhu-db hczhu-db changed the title Introduce group/replica key to store clients Introduce group_replica key to store clients Apr 28, 2024
@hczhu-db hczhu-db changed the title Introduce group_replica key to store clients Introduce group_replica partial response strategy Apr 28, 2024
@hczhu-db hczhu-db force-pushed the query-fan-out-fault-tolerance branch 9 times, most recently from ae27eeb to d330354 Compare May 3, 2024 22:06
@hczhu-db hczhu-db force-pushed the query-fan-out-fault-tolerance branch from d330354 to 2259d30 Compare May 6, 2024 18:58
cmd/thanos/query.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@jnyi jnyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice, just a few small comments, do we have any tests to see how this behaves? Also, please make sure e2e tests pass to avoid regressions

@hczhu-db
Copy link
Collaborator Author

LGTM, nice, just a few small comments, do we have any tests to see how this behaves? Also, please make sure e2e tests pass to avoid regressions

some end-to-end tests are flaky on db_main.

hczhu-db added 3 commits May 14, 2024 11:38
Signed-off-by: HC Zhu (Databricks) <[email protected]>
Signed-off-by: HC Zhu (Databricks) <[email protected]>
@hczhu-db hczhu-db force-pushed the query-fan-out-fault-tolerance branch from a4e03c4 to de6ff4c Compare May 15, 2024 00:46
@hczhu-db hczhu-db force-pushed the query-fan-out-fault-tolerance branch 3 times, most recently from cd0c292 to ae79e3f Compare May 15, 2024 04:44
@hczhu-db hczhu-db force-pushed the query-fan-out-fault-tolerance branch from ae79e3f to 3e96d0d Compare May 15, 2024 05:04
@hczhu-db hczhu-db merged commit ca8da96 into db_main May 15, 2024
12 checks passed
@jnyi jnyi deleted the query-fan-out-fault-tolerance branch November 1, 2024 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants